import numpy as np #linear algebra
import pandas as pd #data manipulation and analysis
import matplotlib.pyplot as plt #data visualization
import seaborn as sns #data visualization
import sklearn.preprocessing as skp #machine learning (preprocessing)
import sklearn.cluster as skc #machine learning (clustering)
import sklearn.metrics as metrics
from scipy.spatial import ConvexHull, convex_hull_plot_2d
import matplotlib.path as mpath
from adjustText import adjust_text
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.metrics import classification_report
from sklearn.metrics import silhouette_score
import warnings # ignore warnings
warnings.filterwarnings('ignore')
path = "../../data/cleanData/popData.csv"
The K-Means clustering algorithm is a popular unsupervised machine learning algorithm used for clustering and segmentation of data into distinct groups based on similarity. It is a simple yet effective algorithm that divides a dataset into k number of clusters, each represented by a centroid point.
The algorithm works by iteratively minimizing the sum of squared distances between each data point and the centroid point of its assigned cluster. This process is achieved by assigning data points to their closest centroid and updating the centroid position based on the new cluster assignments.
The algorithm terminates when the centroids no longer move or after a specified number of iterations. Once the algorithm converges, each data point is assigned to its closest centroid, and the resulting clusters represent distinct groups in the data.
Predictions are based on:
How we will be using K-Means to find counties at risk of implementing State Bill 52
K-Means clustering can identify counties with similar demographic variables based on their demographic features such as age, gender, race, income, education level, and more. We gathered a dataset containing information about each county's demographic variables. We can then apply K-Means clustering to this dataset to similar group counties together based on their demographic similarities.
The first step is to decide on the number of clusters (k) we want to create. Then, we can use our domain knowledge from earlier this semester and statistical methods to determine the optimal number of clusters. Once we have decided on the number of clusters, we can apply the K-Means algorithm to the dataset, with each county represented as a data point in a high-dimensional feature space.
The algorithm will group the counties based on their demographic similarities by minimizing the sum of squared distances between the centroid of each cluster and the data points assigned to it. The resulting clusters will represent groups of counties that have similar demographic characteristics.
We can then analyze each cluster to understand the demographic characteristics of the counties in it and use this information to identify patterns and trends. For example, we may find that counties in one cluster have a higher percentage of elderly residents with a lower income and education level. In comparison, counties in another cluster have a higher percentage of younger residents with a higher income and education level.
We will be reading in our subset of the Census data that only contain demographic variables with relationships with the population, such as age, race, and language, that may help our analysis.
df = pd.read_csv(path, index_col=0)
df.head()
| County Name | Population EstimatesJuly 1 2021() | Persons under 5 years | Persons under 18 years | Persons 65 years and over | Female persons | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Two or More Races | Hispanic or Latino | White alonenot Hispanic or Latino | Veterans2017-2021 | Foreign born persons2017-2021 | Banned or not | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Adams County, Ohio | 27542 | 1707.60 | 6665.16 | 5095.27 | 13798.54 | 26660.66 | 165.25 | 165.25 | 82.63 | 0.00 | 440.67 | 302.96 | 26385.24 | 1840 | 192.79 | 0.0 |
| 1 | Allen County, Ohio | 101670 | 6100.20 | 23587.44 | 18503.94 | 50021.64 | 84081.09 | 12912.09 | 305.01 | 915.03 | 0.00 | 3456.78 | 3761.79 | 81132.66 | 6098 | 1830.06 | 1.0 |
| 2 | Ashland County, Ohio | 52316 | 2929.70 | 11666.47 | 10201.62 | 26524.21 | 50484.94 | 470.84 | 156.95 | 418.53 | 52.32 | 784.74 | 837.06 | 49752.52 | 3076 | 784.74 | 0.0 |
| 3 | Ashtabula County, Ohio | 97337 | 5450.87 | 21414.14 | 19467.40 | 47889.80 | 90231.40 | 3796.14 | 389.35 | 486.68 | 97.34 | 2433.43 | 4672.18 | 86240.58 | 7158 | 1557.39 | 0.0 |
| 4 | Athens County, Ohio | 62056 | 2296.07 | 8998.12 | 8874.01 | 31090.06 | 56657.13 | 1799.62 | 248.22 | 1737.57 | 62.06 | 1613.46 | 1241.12 | 55726.29 | 3255 | 2544.30 | 0.0 |
Scaling is essential when using the KMeans Clustering Algorithm because the Kmeans algorithm group uses the concept of Euclidean distance (distance between two points in a straight line), which is sensitive to the scale of the variables. If the variables are not scaled appropriately, some variables with larger values will significantly impact the distance calculations more than those with smaller values.
Furthermore, scaling the variables allows for a more meaningful interpretation of the clustering results. However, variables with different scales can have different units, making it difficult to compare and interpret their relative importance in determining the clustering structure.
# First we start off removing 'County Name' and 'Banned or not' columns before scaling
cluster_df = df.iloc[:, 1:-1]
cluster_df.head()
| Population EstimatesJuly 1 2021() | Persons under 5 years | Persons under 18 years | Persons 65 years and over | Female persons | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Two or More Races | Hispanic or Latino | White alonenot Hispanic or Latino | Veterans2017-2021 | Foreign born persons2017-2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27542 | 1707.60 | 6665.16 | 5095.27 | 13798.54 | 26660.66 | 165.25 | 165.25 | 82.63 | 0.00 | 440.67 | 302.96 | 26385.24 | 1840 | 192.79 |
| 1 | 101670 | 6100.20 | 23587.44 | 18503.94 | 50021.64 | 84081.09 | 12912.09 | 305.01 | 915.03 | 0.00 | 3456.78 | 3761.79 | 81132.66 | 6098 | 1830.06 |
| 2 | 52316 | 2929.70 | 11666.47 | 10201.62 | 26524.21 | 50484.94 | 470.84 | 156.95 | 418.53 | 52.32 | 784.74 | 837.06 | 49752.52 | 3076 | 784.74 |
| 3 | 97337 | 5450.87 | 21414.14 | 19467.40 | 47889.80 | 90231.40 | 3796.14 | 389.35 | 486.68 | 97.34 | 2433.43 | 4672.18 | 86240.58 | 7158 | 1557.39 |
| 4 | 62056 | 2296.07 | 8998.12 | 8874.01 | 31090.06 | 56657.13 | 1799.62 | 248.22 | 1737.57 | 62.06 | 1613.46 | 1241.12 | 55726.29 | 3255 | 2544.30 |
# Scaling the new data frame for clustering
sc = skp.StandardScaler()
cluster_scale = np.array(cluster_df)
scaled = sc.fit_transform(cluster_scale.astype(float))
scaled_cluster = pd.DataFrame(scaled, columns=cluster_df.columns)
scaled_cluster.head()
| Population EstimatesJuly 1 2021() | Persons under 5 years | Persons under 18 years | Persons 65 years and over | Female persons | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Two or More Races | Hispanic or Latino | White alonenot Hispanic or Latino | Veterans2017-2021 | Foreign born persons2017-2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.483125 | -0.447412 | -0.470140 | -0.519097 | -0.477279 | -0.555122 | -0.298633 | -0.361370 | -0.336760 | -0.443469 | -0.438945 | -0.406408 | -0.562996 | -0.525603 | -0.311825 |
| 1 | -0.146288 | -0.116788 | -0.123544 | -0.148065 | -0.157249 | -0.166313 | -0.081841 | -0.148925 | -0.256107 | -0.443469 | 0.001014 | -0.149526 | -0.165779 | -0.135148 | -0.229159 |
| 2 | -0.370552 | -0.355427 | -0.367705 | -0.377799 | -0.364848 | -0.393802 | -0.293436 | -0.373987 | -0.304214 | -0.183195 | -0.388756 | -0.366741 | -0.393456 | -0.412263 | -0.281937 |
| 3 | -0.165977 | -0.165662 | -0.168057 | -0.121405 | -0.176084 | -0.124667 | -0.236881 | -0.020722 | -0.297611 | 0.040764 | -0.148262 | -0.081913 | -0.128719 | -0.037947 | -0.242926 |
| 4 | -0.326294 | -0.403119 | -0.422357 | -0.414535 | -0.324509 | -0.352008 | -0.270836 | -0.235250 | -0.176410 | -0.134741 | -0.267871 | -0.336732 | -0.350114 | -0.395849 | -0.193097 |
The elbow plot is a graphical method to determine the optimal number of clusters in a k-means clustering algorithm. It plots the within-cluster sum of squares (WSS) against the number of clusters. The WSS score is a way to quantify how well a clustering algorithm can group similar data points. It tells us how far each point is from its assigned group center and how similar the points within each group are. A lower WSS score means that the points within each group are similar and that the groups are more compact and well-defined.
###Decide n-cluster using Elbow Method
wss=[]
k_range = range(1,12)
for i in k_range:
kmeans = skc.KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(scaled_cluster)
wss.append(kmeans.inertia_)
fig, ax = plt.subplots(figsize=(8, 6), dpi=80)
plt.plot(k_range, wss, marker='o')
for i, value in enumerate(wss):
ax.text(i+1.05, value-0.005, round(value,1), fontsize=12, fontweight='bold')
plt.xticks(k_range)
# plt.grid()
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WSS')
# plt.savefig('elbow_methodA.png')
plt.show()
The elbow plot shows a decreasing trend in the WSS score as the number of clusters increases. Therefore, the critical point to look for in an elbow plot is the "elbow" point in the graph, where the rate of decrease in the WSS score starts to level off. This point is called the elbow because the graph may look like the arm and elbow joint. Therefore, the optimal number of clusters to use is typically the value at the elbow point. However, if the elbow point needs to be clarified, choosing a value that still provides a good balance between a low WSS score and not too many clusters is recommended to prevent overfitting.
As we can see, the "elbow" point in the graph looks like the optimal number of clusters to use would be around 2 or 3 clusters
The silhouette method is used to evaluate the quality of clustering results. It measures how well each data point in a cluster is separated from data points in other clusters. The silhouette score ranges from -1 to 1, where a score closer to 1 indicates that the data point is well-matched to its cluster and poorly matched to neighboring clusters. As the number of clusters increases, the silhouette score decreases, indicating that the clustering is becoming weaker and less effective.
# Define the range of k values to try
k_range = range(2, 11)
# Define an empty list to store the silhouette scores for each k value
silhouette_scores = []
# Loop over the range of k values
for k in k_range:
# Fit a KMeans model with the current k value
kmeans = skc.KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_cluster) # Replace X with your feature matrix
# Calculate the silhouette score for the current clustering
silhouette_avg = silhouette_score(scaled_cluster, kmeans.labels_)
# Append the silhouette score to the list of scores
silhouette_scores.append(silhouette_avg)
# Print the current k value and silhouette score
print(f"k = {k}, silhouette score = {silhouette_avg:.3f}")
k = 2, silhouette score = 0.871 k = 3, silhouette score = 0.779 k = 4, silhouette score = 0.646 k = 5, silhouette score = 0.636 k = 6, silhouette score = 0.601 k = 7, silhouette score = 0.598 k = 8, silhouette score = 0.508 k = 9, silhouette score = 0.469 k = 10, silhouette score = 0.447
In the output provided, we have the silhouette scores for a range of values of k, where k represents the number of clusters:
All the k values provided show that the data points are well separated. The optimal value of k is two, as it has the highest silhouette score. However, it is essential to consider other factors, such as the clusters' interpretability and usefulness for the task at hand, when choosing the number of clusters for a specific application.
While the silhouette score for k=3 is lower than k=2. For this analysis, choosing k=3 is the optimal choice. Choosing k=3 is because the counties implementing the SB 52 Bill have either a full or partial ban, suggesting that the data should be grouped into three distinct clusters.
# Clustering K Means, K=3
kmeans_3 = skc.KMeans(n_clusters=3,random_state=42)
kmeans_3.fit(scaled_cluster)
kmeans_3.labels_
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
dtype=int32)
# Assign clustering result to each country in the data frame
cluster_df['cluster_id'] = kmeans_3.labels_
cluster_df.head()
| Population EstimatesJuly 1 2021() | Persons under 5 years | Persons under 18 years | Persons 65 years and over | Female persons | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Two or More Races | Hispanic or Latino | White alonenot Hispanic or Latino | Veterans2017-2021 | Foreign born persons2017-2021 | cluster_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27542 | 1707.60 | 6665.16 | 5095.27 | 13798.54 | 26660.66 | 165.25 | 165.25 | 82.63 | 0.00 | 440.67 | 302.96 | 26385.24 | 1840 | 192.79 | 0 |
| 1 | 101670 | 6100.20 | 23587.44 | 18503.94 | 50021.64 | 84081.09 | 12912.09 | 305.01 | 915.03 | 0.00 | 3456.78 | 3761.79 | 81132.66 | 6098 | 1830.06 | 0 |
| 2 | 52316 | 2929.70 | 11666.47 | 10201.62 | 26524.21 | 50484.94 | 470.84 | 156.95 | 418.53 | 52.32 | 784.74 | 837.06 | 49752.52 | 3076 | 784.74 | 0 |
| 3 | 97337 | 5450.87 | 21414.14 | 19467.40 | 47889.80 | 90231.40 | 3796.14 | 389.35 | 486.68 | 97.34 | 2433.43 | 4672.18 | 86240.58 | 7158 | 1557.39 | 0 |
| 4 | 62056 | 2296.07 | 8998.12 | 8874.01 | 31090.06 | 56657.13 | 1799.62 | 248.22 | 1737.57 | 62.06 | 1613.46 | 1241.12 | 55726.29 | 3255 | 2544.30 | 0 |
cluster_df['Banned or not'] = df.iloc[:,-1]
cluster_df['County Name'] = df.iloc[:,0]
cluster_df.head()
| Population EstimatesJuly 1 2021() | Persons under 5 years | Persons under 18 years | Persons 65 years and over | Female persons | White alone | Black or African American alone | American Indian and Alaska Native alone | Asian alone | Native Hawaiian and Other Pacific Islander alone | Two or More Races | Hispanic or Latino | White alonenot Hispanic or Latino | Veterans2017-2021 | Foreign born persons2017-2021 | cluster_id | Banned or not | County Name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27542 | 1707.60 | 6665.16 | 5095.27 | 13798.54 | 26660.66 | 165.25 | 165.25 | 82.63 | 0.00 | 440.67 | 302.96 | 26385.24 | 1840 | 192.79 | 0 | 0.0 | Adams County, Ohio |
| 1 | 101670 | 6100.20 | 23587.44 | 18503.94 | 50021.64 | 84081.09 | 12912.09 | 305.01 | 915.03 | 0.00 | 3456.78 | 3761.79 | 81132.66 | 6098 | 1830.06 | 0 | 1.0 | Allen County, Ohio |
| 2 | 52316 | 2929.70 | 11666.47 | 10201.62 | 26524.21 | 50484.94 | 470.84 | 156.95 | 418.53 | 52.32 | 784.74 | 837.06 | 49752.52 | 3076 | 784.74 | 0 | 0.0 | Ashland County, Ohio |
| 3 | 97337 | 5450.87 | 21414.14 | 19467.40 | 47889.80 | 90231.40 | 3796.14 | 389.35 | 486.68 | 97.34 | 2433.43 | 4672.18 | 86240.58 | 7158 | 1557.39 | 0 | 0.0 | Ashtabula County, Ohio |
| 4 | 62056 | 2296.07 | 8998.12 | 8874.01 | 31090.06 | 56657.13 | 1799.62 | 248.22 | 1737.57 | 62.06 | 1613.46 | 1241.12 | 55726.29 | 3255 | 2544.30 | 0 | 0.0 | Athens County, Ohio |
# Save data
cluster_df.to_csv('../viz/popClusterData.csv')
# Assuming `df` is a pandas DataFrame with a column 'column_name' that contains 0 and 1
cluster_df['banned'] = cluster_df['Banned or not'].replace({0: 'no', 1: 'yes'})
cluster_df['cluster'] = cluster_df['cluster_id'].replace({0: '0', 1: '1', 2:'2'})
fig = px.histogram(cluster_df, x='cluster', color='cluster', pattern_shape="banned")
fig.show()
# edit x and y axis lables
The plot shows the number of counties in each cluster. The shaded region in the plot shows each cluster's banned counties.
Because we did not use KMeans clustering in Progress Report 3, we will not provide a scatter plot with ground truth and KMeans classification. Instead, we will evaluate our cluster results and see how accurate our model is.
y_true = cluster_df['Banned or not']
y_pred = cluster_df['cluster_id']
cluster_names = ['cluster 0', 'cluster 1', 'cluster 2']
print(classification_report(y_true, y_pred, target_names=cluster_names))
precision recall f1-score support
cluster 0 0.89 0.90 0.89 78
cluster 1 0.14 0.10 0.12 10
cluster 2 0.00 0.00 0.00 0
accuracy 0.81 88
macro avg 0.34 0.33 0.34 88
weighted avg 0.80 0.81 0.80 88
This report shows the results of a k-means clustering model.
Here is a quick explanation of what each metric means and how it is used to evaluate the performance of the model:
For cluster 0, the precision is 0.89, indicating that 89% of the data points that the model predicted as cluster 0 were actually in cluster 0. The recall is 0.90, indicating that the model correctly identified 90% of the data points in cluster 0. The F1-score is 0.89, which is a harmonic mean of precision and recall, representing the model's overall performance on this cluster.
For cluster 1, the precision is 0.14, indicating that 14% of the data points the model predicted as cluster 1 were actually in cluster 1. The recall is 0.10, indicating that the model correctly identified 10% of the data points in cluster 1. The F1-score is 0.12, which is a harmonic mean of precision and recall, representing the model's overall performance on this cluster.
For cluster 2, the precision, recall, and F1-score are all 0 because the model did not predict any data points in those clusters. The model's accuracy is 0.81, which means that the model correctly classified 81% of the data points.
The model performed well on cluster 0 but failed to identify any data points in clusters 1 and 2. Nevertheless, the weighted average F1-score of 0.73 suggests that the model's overall performance is satisfactory, compared to the 67% accuracy.
In the file 'popViz.ipynb' file, we will create a convex hull around all the banned counties in each cluster and look closer to find the variable with the most counties and the counties that appear in the most common variables.